Background: RNA-seq is now widely used to quantitatively assess gene expression, expression differences and\r\nisoform switching, and promises to deliver results for the entire transcriptome. However, whether the transcriptional\r\nstate of a gene can be captured accurately depends critically on library preparation, read alignment, expression\r\nestimation and the tests for differential expression and isoform switching. There are comparisons available for the\r\nindividual steps but there is not yet a systematic investigation which specific genes are impacted by biases\r\nthroughout the entire analysis workflow. It is especially unclear whether for a given gene, with current methods\r\nand protocols, expression changes and isoform switches can be detected.\r\nResults: For the human genes, we report their detectability under various conditions using different approaches.\r\nOverall, we find that the input material has the biggest influence and may, depending on the protocol and RNA\r\ndegradation, exhibit already strong length-dependent over- and underrepresentation of transcripts. The alignment\r\nstep aligns for 50% of the isoforms up to 99% of the reads correctly; only in the presence of transcript modifications\r\nmainly short isoforms will have a low alignment rate. In our dataset, we found that, depending on the aligner and\r\nthe input material used, the expression estimation of up to 93% of the genes being accurate within a factor of two;\r\nwith the deviations being due to ambiguous alignments. Detection of differential expression using a negativebinomial\r\ncount model works reliably for our simulated data but is dependent on the count accuracy. Interestingly,\r\nusing the fold-change instead of the p-value as a score for differential expression yields the same performance in\r\nthe situation of three replicates and the true change being two-fold. Isoform switching is harder to detect and for\r\nat least 109 genes the isoform differences evade detection independent of the method used.\r\nConclusions: RNA-seq is a reliable tool but the repetitive nature of the human genome makes the origin of the\r\nreads ambiguous and limits the detectability for certain genes. RNA-seq does not equally well represent isoforms\r\nindependent of their size which may range from ~200nt to ~100'000nt. Researchers are advised to verify that their\r\ntarget genes do not have extreme properties with respect to repeated regions, GC content, and isoform length and\r\ncomplexity.
Loading....